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Abstract 

Distributions over exchangeable matrices with infinitely many columns, 
such as the Indian buflet process, are useful in constructing nonparametric 
latent variable models. However, the distribution implied by such models 
over the number of features exhibited by each data point may be poorly- 
suited for many modeling tasks. In this paper, we propose a class of 
C/2 exchangeable nonparametric priors obtained by restricting the domain of 

existing models. Such models allow us to specify the distribution over the 
I number of features per data point, and can achieve better performance on 

^ data sets where the number of features is not well-modeled by the original 

l/^ distribution. 

^ 1 Introduction 

ON 

The Indian buffet process (IBP)|9] and the related infinite gamma- Poisson pro- 
cess (iGaP)[T4] are distributions over matrices with exchangeable rows and in- 
finitely many columns, only a finite (but random) number of which contain any 
^ non-zero entries. Such distributions have proved useful for constructing flexible 

latent factor models that do not require us to specify the number of latent fac- 
tors a 'priori. In such models, each column of the random matrix corresponds to 
a latent feature, and each row to a data point. The non-zero elements of a row 
select the subset of features that contribute to the corresponding data point. 

However, distributions such as the IBP and the iGaP make certain assump- 
tions about the structure of the data that may be inappropriate. Specifically, 
such priors impose distributions on the number of data points that exhibit a 
given feature, and on the number of features exhibited by a given data point. 
For example, in the IBP, the number of features exhibited by a data point is 
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marginally Poisson-distributed, and a feature appears in a new data point with 
probability m/(N + 1), where N is the number of previously seen data points, 
and m is the number of times that feature has appeared. 

These properties may be too constraining for many modeling tasks. There 
are a number of cases where we might want to increase the flexibility of these 
models by allowing non-Poisson marginals over the number of latent features per 
data point, or by adding constraints on the number of features. For example, 
the IBP has been used to select possible next states in a hidden Markov model 
[7]. In such a model, we do not expect to see a state that allows no transitions 
(including self-transitions). Nonetheless, because a data point in the IBP can 
have zero features with non-zero probability, our prior supports states with no 
valid transition distribution. Similarly, the iGaP has been used to model features 
in images |14j , and we may wish to exclude the possibility of a featureless image. 

One interesting example arises when we expect, or desire, the latent features 
to correspond to interpretable features, or causes, of the data [TB]. We might 
believe that each data point exhibits exactly K features - corresponding perhaps 
to speakers in a dialog, members of a team, or alleles in a genotype - but 
be agnostic about the total number of features in our dataset. A model that 
explicitly encodes this prior expectation about the number of features per data 
point will tend to lead to more interpretable and parsimonious results. 

In other situations, we may believe that the number of features per data 
point follows a distribution other than that implied by the IBP. For example, 
it is well known that text and network data tends to exhibit power-law behav- 
ior, suggesting a need for models that impose heavy-tailed distributions on the 
number of features. 

In the case of the IBP, two- and three-parameter extensions have been pro- 
posed that modify the distribution over the number of data points that exhibit 
a feature [T31 HI [H] . While these extensions increase flexibility in the distribu- 
tions over the number of data points exhibiting each feature, the distribution 
over the number of features per data point remains Poisson. As we will see, 
this is an inherent consequence of the use of a completely random measure as 
both prior and likelihood. In this paper, we consider methods for varying the 
distribution over the number of features, by removing the completely random 
assumption. 

2 Exchangeability 

We say a finite sequence {Xi, . . . ,Xn) is exchangeable (see, for example, [T]) if 
its distribution is unchanged under any permutation a of {1, ... , N}. Further, 
we say that an infinite sequence Xi, X2, ... is infinitely exchangeable if all of its 
finite subsequences are exchangeable. Such distributions are appropriate when 
we do not believe the order in which we see our data is important, or when we 
do not have access to all data points. 

De Finetti's law tells us that a sequence is exchangeable iff the observations 
are i.i.d. given some latent distribution. This means that we can write the 
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probability of any exchangeable sequence as 

PiXi^xi,X2^x2,...)^ f Yl^le{x^^x,\eMe)de (i) 

Je j 

for some probability distribution v over parameter space, and some parametrised 
family {/ie}eee of conditional probability distributions. 

Throughout this paper, we will use the notation p(xi, X2, ■ ■ ■) = P{Xi = 
Xi,X2 = X2,- ■ ■) to represent the joint distribution over an exchangeable se- 
quence xi, X2, • • ■ , and p{xn+i\xi, . . . , Xn) to represent the associated predictive 
distribution. We will also use the notation p{xi, . . . ,Xn,0) := Jli'LiMel^i = 
Xi\9)v{6) to represent the joint distribution over the observations and the di- 
recting measure 9. In general 9 may be infinite dimensional, which motivates 
the close link between the exchangeability assumption and the need for Bayesian 
nonparametric models. 

2.1 Distributions over exchangeable matrices 

The Indian buffet process (IBP)|5] is a distribution over binary matrices with 
exchangeable rows and infinitely many columns. In the de Finetti representa- 
tion, the mixing measure v \s a. beta process, and the conditional distribution 
/ig is a Bernoulli process [T3]. The beta process and the Bernoulli process are 
both completely random measures - distributions over random measures that 
assign independent masses to disjoint subsets, that can be written in the form 
^ ~ Sfe°=i ""fe^Sfc [H]- I'^ the parametrization of the beta process commonly 
used for the IBP, the masses of the atoms TTfc of a sample from a beta process 
can be seen as the infinitesimal limit of Beta(a(i-ffoi ^^adHo) random variables, 
for some positive scalar a and CDF ifg- The masses of the atoms of a sample 
from a Bernoulli process are distributed according to Bernoulli(dGo), for some 
piecewise-constant function Gq : <-f — !■ [0, 1] with an at most countable num- 
ber of jumps. In the context of the IBP, Go is the cumulative function of the 
beta-process-distributed measure - so each atom of the beta process gives the 
probability for a collection of Bernoulli random variables. We can think of the 
atoms of the beta process as determining the latent probability for a column of 
a matrix with infinitely many columns, and the Bernoulli process as sampling 
binary values for the entries of that column of the matrix. The resulting matrix 
has a finite number of non-zero entries, with the number of non-zero entries in 
each row distributed as Poisson(a) and the total number of non-zero columns in 
N rows distributed as Poisson(ai?Ar), where H^^ is the TVth harmonic number. 
The number of rows with a non-zero entry for a given column exhibits a "rich 
gets richer" property - a new row has a one in a given column with probability 
proportional to the number of times a one has appeared in that column in the 
preceding rows. 

Several models have been formulated that allow us to vary the distribution 
over the total number of features and the degree to which features are shared 
between data points. A two-parameter extension of the IBP (TUl US] can be 
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obtained by introducing an extra parameter to the beta process, so that the 
column probabiHties are distributed according to the infinitesimal limit of a 
Beta(ca(i77o, c(l — adHo)) distribution. The parameter c controls the degree of 
sharing of the features in the resulting IBP: As c — 0, all data points share the 
same features, and as c — )■ oo, all data points have disjoint feature sets. A three- 
parameter extension [Hj replaces the beta process with a completely random 
measure called the stable-beta process, which includes the beta process as a 
special case. The resulting IBP exhibits power law behavior: the total number 
of features exhibited in a dataset of size N grows as 0(7^) for some s > 0, and 
the number of data points exhibiting each feature also follows a power law. 

A related distribution over exchangeable matrices is the infinite gamma- 
Poisson process (iGaP)[Tl]. Here, the de Finetti mixing measure is the gamma 
process, and the family of conditional distributions is given by the Poisson 
process. The atoms of the gamma process correspond to the columns of a 
matrix, in a manner similar to the beta process in the IBP. In this case, the atoms 
determine the mean value of the column, and the Poisson process populates the 
column of the matrix with Poisson random variables with this mean. The result 
is a distribution over non-negative integer-valued matrices with infinitely many 
columns and exchangeable rows. The sum of each row is distributed according 
to a negative binomial distribution. 

3 Removing the Poisson assumption 

In Section [2?T] we saw that, while existing methods are able to alter the degree 
of sharing of features and the total number of features in the IBP, they have 
not been able to remove the Poisson assumption on the number of features per 
data point. This is noted by Teh and Goriir who point out 

One aspect of the [three-parameter IBP] which is not power-law is 
the number of dishes each customer tries. This is simply Poisson(Q!) 
distributed. It seems difficult to obtain power-law behavior in this 
aspect within a CRM framework, because of the fundamental role 
played by the Poisson process. 

To elaborate on this, note that, marginally, the distribution over the value 
of each element Zk of a row z of the IBP is given by a Bernoulli distribution. 
Therefore, by the law of rare events, the sum J2k distributed according 
to a Poisson distribution. A similar argument applies to the infinite gamma- 
Poisson process. In general, any distribution over exchangeable random matrices 
based on a homogeneous CRM will have rows marginally distributed as i.i.d. 
random variables. In the case of binary matrices, these random variables must 
be Bernoulli, so their sum will either be Poisson, or infinite. Therefore, in order 
to circumvent the requirement of a Poisson number of features in an IBP-like 
model, we must remove the completely random assumption on either the de 
Finetti mixing measure or the family of conditional distributions. 
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3.1 Restricting the family of conditional distributions 

We are familiar with the idea of restricting the support of a distribution to 
a measurable subset. For example, a truncated Gaussian is a Gaussian dis- 
tribution restricted to a certain contiguous section of the real line. In gen- 
eral, we can restrict the support of an arbitrary probability distribution /x on 
some space $7 to a measurable subset A C in the support of /i by defining 
/xl'^(-) :— £ where I(-) is the indicator function. 

Theorem 1 (Restricted exchangeable distributions). We can always restrict 
the support of an exchangeable distribution on some space by restricting the 
family of conditional distributions {ne}ee& introduced in Equation^ to obtain 
an exchangeable distribution on the restricted space. 

Proof. Consider an unrestricted exchangeable model with de Finetti represen- 
tation p{xi, . . . ,xn, Q) = Hill l^o{Xi — Xi)v{9). Let p^^ be the restriction of p 
such that Xi ^ A,i = 1,2, . . . , obtained by restricting the family of conditional 
distributions {fJ-e} to {/ig"*} as described above. Then 



N N 



.1 ^J^ ^«(^» ^ ^) 



and 



F {xn+i\xi,...,xn) oc — j^-j^ — —v{9)de (2) 

-'e n,;=i MX, e A) 

is an exchangeable sequence by construction, according to de Finetti's law. □ 

We give two examples based on the IBP. 

Example 1 (Restriction to a fixed number of non-zero entries per row). Recall 
that, conditioned on a latent beta process- distributed measure B := '^j^'^kSe^, 
a sample from the IBP is distributed according to a Bernoulli process. This 
distribution has support in {0, 1}°°. We can restrict the support of this Bernoulli 
process to an arbitrary measurable subset A C {0, 1}°° - for example, the set of 
all vectors z € {0, 1}°° such that J2k — S for some integer S. The conditional 
distribution of a matrix Z = {zi, . . . , z^v} under such a distribution is given by: 



(MEk Z^k = S)r PotBtn{S\{7rk}f^,^^ 



(3) 

where = "^"^ '^"''^ ^''*-^*'^('l{''''fc}fcLi) infinite limit of the Poisson- 

binomial distribution Q/, which describes the distribution over the number of 
successes in a sequence of independent but non-identical Bernoulli trials. The 
probability of Z given in Equation is the infinite limit of the conditional 
Bernoulli distribution which describes the distribution of the locations of 
the successes in such a trial, conditioned on their sum. 
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Example 2 (Restriction to a random number of non-zero entries per row). 
Rather than specify the number of non-zero entries in each row a priori, we can 
allow it to be random, with some arbitrary distribution / (•) over the non-negative 
integers. A Bernoulli process restricted to have f -marginals can be described as 

4(z) = n,fiz. = ..,ns.) = ( n-rd--)"-"')-!! j^siiklE;'' 

(4) 

where Sn — X^fc-^nfc- Again, if we marginalize over B , the resulting distribution 
is exchangeable, because mixtures of i.i.d. distributions are i.i.d. 

We note that, even if we choose / to be Poisson(a), we wiU not recover 
the IBP. The IBP has Poisson(Q!) marginals over the number non-zero elements 
per row, but the conditional distribution is described by a Poisson-binomial 
distribution. The Poisson-restricted IBP, however, will have Poisson marginal 
and conditional distributions. 

We also note that the fixed-row-sum model of ExampleJT] can be seen as a 
special case of the random-distribution model of Example [2] where the distri- 
bution / is degenerate on S. 

Figure [l] shows some examples of samples from the single-parameter IBP, 
with parameter a ~ 5, with various restrictions applied. 



IBP 1 per row 5 per row 10 per row Uniform{1 20} Power-law {s=2) 




Figure 1: Samples from restricted IBPs. 



3.2 Direct restriction of the predictive distributions 



The construction in Section 3.1 is explicitly conditioned on a draw B from the 
de Finetti mixing measure v. Since it might be cumbersome to explicitly repre- 
sent the infinite dimensional object B, it is tempting to consider constructions 
that directly restrict the predictive distribution p{X]\r^i\Xi, . . . , Xjv), where B 
has been marginalized out. In other words, can we simply sample from an 
exchangeable distribution and discard samples that fall outside our region of 
interest? 

We can certainly find examples of exchangeable sequences that remain ex- 
changeable after restricting their conditional distributions: 

Example 3 (Infinite gamma-Poisson process). Consider restricting the predic- 
tive distribution of the infinite gamma-Poisson distribution such that each row 
sums to S . In the predictive distribution for the iGaP, for each previously ob- 
served feature k, we sample an element Xnk ^ NegBinom{mk, n/ (n + 1)). We 
then sample a value N* ^ NegBinom{9,n/{n + l)) and assign N* counts to new 
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features according to a Chinese restaurant process. If we restrict this model such 
that each row sums to 1, we have: 

|W„ \ P{^{N+l)k = MXi.,N)'n.:j^kPi^(N+l)j = 0\Xi:n) 

{ J2 "m +6 if feature k has been seen before 
^ — - — r-H otherwise. 

In other words, the infinite gamma-Poisson process restricted to sum to one 
is a Chinese restaurant process. If we restrict the iCaP to sum to S, we have S 
samples per data point from a Chinese restaurant process. 

However, this result does not hold for direct restriction of arbitrary exchange- 
able sequences. 

Theorem 2 (Sequences obtained by directly restricting the predictive distribu- 
tion of an exchangeable sequence are not, in general, exchangeable.). Let p be 
the distribution of the unrestricted exchangeable model introduced in the proof 
of Theorem^ Let p*'^^ be the distribution obtained by directly restricting this 
unrestricted exchangeable model such that Xn G A, i.e. 



HA 



{xn+i\xi,. . . ,xn) o: ^^"---j^ — :• (5) 



In general, this will not be equal to Equation [1| and cannot be expressed as a 
mixture of i.i.d. distributions. 

Proof. To demonstrate that this is true, consider the counterexample given in 
Example |4] □ 

Example 4 (A three- urn buffet). Consider a simple form of the Indian buf- 
fet process, with a base measure consisting of three unit-mass atoms. We can 
represent the predictive distribution of such a model using three indexed urns, 
each containing one red ball (representing a one in the resulting matrix) and one 
blue ball (representing a zero in the resulting matrix). We generate a sequence 
of ball sequences by repeatedly picking a ball from each urn, noting the ordered 
sequence of colors, and returning the balls to their urns, plus one ball of each 
sampled color. 

Proposition 1. The three-urn buffet is exchangeable. 

Proof. By using the fact that a sequence is exchangeable iff the predictive dis- 
tribution given the first N elements of the sequence of the N + 1st and + 2nd 
entries is exchangeable [6^, it is trivial to show that this model is exchangeable 
and that, for example, 

p{Xn+i = {r,b,r),XN+2 = {r,r,b)\Xi.N) 

mim2{N + 1 — ma) (m -|- 1 -|- 1){N -\- 1 — m2)rn^ 



[N+lf {N + 2f 

-■p{Xn+i = {r,r,b),XN+2 = {T;b,r)\Xi.,N) , 



(6) 
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where rrii is the number of times in the first A'^ samples that the ith bah in a 
sample has been red. □ 

Proposition 2. The directly restricted three-urn scheme (and, by extension, 
the directly restricted IBP) is not exchangeable. 

Proof. Consider the same scheme, but where the outcome is restricted such that 
there is one, and only one, red ball per sample. The probability of a sequence 
in this restricted model is given by 



p*{Xn+i = x\Xi.,n) - J2k=i N+l-mJ i'^'^ = '')/Efc=i 7r+r= 

and, for example, 

P*{Xn+i = {r,b,b),XN+2 = {b,r,b)\Xi.,N) 

mi m2 
_ Af+l-mi 7V+2-m3 

E~ wifc ' 7112 ni2 L mfc 

*; Af+l-rrifc Af+l-m2 N+2-m2 ' N+l-rrik 

^P*{Xn+i = {b,r,b),XN+2 - {r,b,b)\X,.,N), 



(7) 



therefore the restricted model is not exchangeable. By introducing a normalizing 
constant - corresponding to restricting over a subset of {0, 1}'^ ~ that depends 
on the previous samples, we have broken the exchangeability of the sequence. 

By extension, a model obtained by directly restricting the predictive distri- 
bution of the IBP is not exchangeable. □ 

This section shows that, while directly restricting the predictive distribu- 
tion of the IBP is appealing because it avoids instantiating the infinite latent 
measure, this construction does not yield an exchangeable distribution. Mod- 
ifying a Gibbs sampler for the IBP based on the directly restricted predictive 
distribution would not yield a valid sampler for either the above model, or the 



exchangeable model described in Section 3. 1 For the remainder of the paper, we 



focus on developing valid sampling schemes for the exchangeable model, which 
we will refer to as a restricted IBP (rIBP). 



4 Inference 

In this section, we focus on inference methods for restricted IBPs, since samplers 
for the restricted iGaP can easily be obtained by modifying existing samplers 
for the GRP. 

We focus on sampling in a truncated model, where we approximate the 
countably infinite sequence {Trk}'kLi with a large, but finite, vector tt (tti, . . . , ttk), 
where each atom tt^ is distributed according to Beta,{a/K, 1). Gonditioned on 
TT, we can evaluate the probability of a given matrix Z: 

.^(z)„ nL.g(i-^)'--'/(^) 

n„=iPoiBm(S'„|7r) 
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where 5„ = J^k ^nk and ruk = ^nfe- 

Let g{XyL) be the probabihty of the data given a binary matrix Z. If the 
number of entries in each row is random and distributed according to /, then 
we can Gibbs sample each entry of Z according to 

p{znk = l|a;„,7r,Z^„fe,^z„j = a) 

oc TTfc— = ^^'^ . g{Xn\Znk = 1, Z^„fe, Z^„) 

p(2;nfe = 0|a;„,7r,Z^„fc,^z„j = a) 

f {o?j 

oc (1 - TTfc) r^g(a;„|2;„fe = 0, Z^„fc, Z^„) 

If the number of non-zero entries per row is fixed, we must resample the 
location of the non-zero entries. Let z„ indicate the location of the jih non- 
zero entry of z„. We can Gibbs sample z„ according to 

= k\x^, 7V, Z(„-^)) CX ^^g{xn\zi^'> = k, Z^-^\T.^^) . (10) 

Gibbs sampling alone can yield poor mixing, especially in the case where the 
sum of each row is fixed. To alleviate this problem, we incorporate Metropolis 
Hastings moves that propose an entire row of Z. 

Conditioned on Z, the the distribution of tt is described by 

,1/ 



n;LiPoiBin(5„|7r) 

The Poisson-binomial term can be calculated exactly in 0(K^^^ z^k) using ei- 
ther a recursive algorithm [2][B] or an algorithm based on the characteristic func- 
tion that uses the Discrete Fourier Transform It can also be approximated 
using a skewed-normal approximation to the Poisson-binomial distribution [TS] . 
We can therefore sample from the posterior of tt using Metropolis Hastings steps. 
Since we believe the posterior will be close to the posterior for the unrestricted 
model, we use the proposal distribution q{7Tk\Z) = Beta(Q;/_ft' + r7i/j, + 1 — rrik) 
to propose new values of ttj,. 

In certain cases, we may wish to directly evaluate the predictive distribu- 
tion pl^(zjv+i |zi, . . . , zat). Unfortunately, in the case of the IBP, we are unable 
to perform the integral in Equation [2] analytically. We can, however, estimate 
the predictive distribution using importance sampling. We sample T measures 
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77^*) ^ iy{Tr\Z), where :/(7r|Z) is the posterior distribution over tt in the finite ap- 
proximation to the IBP, and then weight them to obtain the restricted predictive 
distribution 

I/. I X 1 ELiwtM!/(t)(2;w+i) 

pl-^(zAr+i Zi,...,ZAr) W , (12) 

where wt = /i^(t)(zi, . . . , ZAr)/^^(t) (zi, . . . ,zjv), and p}^(t){-) is given by Equa- 
tion |8] 



5 Experimental evaluation 

In this paper, we have described how distributions over exchangeable matrices, 
such as the IBP, can be modified to allow more flexible control on the distri- 
butions over the number of latent features, and described methods to perform 
inference in such models. In this section, we perform experiments on both real 
and synthetic data. The synthetic data experiments are designed to show that 
appropriate restriction can yield more interpretable features, and to explore 
which inference techniques are appropriate in which data regimes. The experi- 
ments on real data are designed to show that careful choice of the distribution 
over the number of latent features in our models can lead to improved predictive 
performance. 

5.1 Synthetic data 

We begin by evaluating the restricted IBP on synthetic image data. We gener- 
ated 50 images, consisting of two binary features selected at random from a set 
of four possible features, plus Gaussian noise. This experiment is a variant of 
an image analysis experiment performed in [9]. 

We tried to learn the latent features using two models: A single-parameter 
IBP, and a single-parameter IBP restricted to have two features present in each 
data point. In the restricted model, we alternately sampled tt and Z as described 
in Section |4j for the vanilla IBP we Gibbs sampled the tt^- in a truncated model. 
In both cases we fixed a — 2 and truncated the model to allow 100 features. 
Both models were run for 10000 iterations. 

Figure [2] shows the features recovered by both models, and some sample 
image reconstructions. By incorporating prior knowledge about the number of 
features, the restricted model is able to find the expected features and achieve 
superior reconstructions. 

5.2 Classification of text data 

The IBP and its extensions have been used to directly model text data|131 [T2|. 
In such settings, the IBP is used to directly model the presence or absence of 
words, and so the matrix is observed rather than latent, and the total number 
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Figure 2: Left: Generating features and sample images. Center/right: Features 
and reconstructions learned using the IBP (center) and the IBP restricted to 
have two features per data point (right). 



of features is givenby the vocabulary size. We hypothesise that the Poisson 
assumption made by the IBP is not appropriate for text data, as the statistics 
of word use in natural language tends to follow a heavier tailed distribution |17] . 
To test this hypothesis, we modeled a collection of corpora using both an IBP, 
and an IBP restricted to have heavier tailed distributions over the number of 
features in each row. Our corpora were 20 collections of newsgroup postings on 
various topics (for example, comp. graphics, rec.autos, rec. sport. hockey |^ To 
evaluate the quality of the models, we classified held out documents based on 
their probability under each topic. This experiment is designed to replicate an 
experiment performed by |12) to compare the original and three-parameter IBP 
models. 

For our restricted model, we chose a negative Binomial distribution over the 
number of words. For both the IBP and the rIBP we estimated the predictive 
distribution by generating 1000 samples from the posterior of the beta process 
in the IBP model. No pre-processing of the documents was performed. Since 
the vocabulary (and hence the feature space) is finite, we used finite versions of 
both the IBP and the rIBP. Due to the very large state space, we restricted our 
samples such that, in a single sample, atoms with the same posterior distribu- 
tion were assigned the same value. In the case of the IBP, we used these samples 
directly to estimate the predictive distribution; for the restricted model, we used 
the importance- weighted samples obtained using Equation |12[ For each model, 
a was set to the mean number of features per document in the correspond- 
ing group, and the maximum likelihood parameters were used for the negative 
Binomial distribution. For each model, we trained on 1000 randomly selected 
documents, and tested on a further 1000 documents. 

We evaluated the models by classifying the remaining documents based on 
their likelihood under each of the 20 newsgroups. We looked at the fraction 
correctly classified at n - ie for each n — 1, . . . , 20 we looked at whether the 
correct label is one of the n most likely labels. Table [l] shows the fraction of 
documents correctly classified in the first n labels. The restricted IBP performs 
uniformly better than the unrestricted model. 

^http: / /people. csail.mit.cdu/jrennie/20Newsgroups/ 
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1 


2 


3 


4 


5 


IBP 


0.591 


0.726 


0.796 


0.848 


0.878 


rIBP 


0.622 


0.749 


0.819 


0.864 


0.918 



Table 1: Proportion correct at n on classifying documents from the 20newsgroup 
dataset. 



6 Conclusion 

In this paper we have explored ways of relaxing the distributional assumptions 
made by existing exchangeable nonparametric processes. The resulting models 
allow us to specify a distribution over the number of features exhibited by each 
data point, permitting greater flexibility in model specification. As future work, 
we intend to explore which applications and models can most benefit from the 
distributional flexibility afforded by this class of models. 
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